31 research outputs found

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    A Framework for Customizable FPGA-based Image Registration Accelerators

    Get PDF
    Image Registration is a highly compute-intensive optimization procedure that determines the geometric transformation to align a floating image to a reference one. Generally, the registration targets are images taken from different time instances, acquisition angles, and/or sensor types. Several methodologies are employed in the literature to address the limiting factors of this class of algorithms, among which hardware accelerators seem the most promising solution to boost performance. However, most hardware implementations are either closed-source or tailored to a specific context, limiting their application to different fields. For these reasons, we propose an open-source hardware-software framework to generate a configurable architecture for the most compute-intensive part of registration algorithms, namely the similarity metric computation. This metric is the Mutual Information, a well-known calculus from the Information Theory, used in several optimization procedures. Through different design parameters configurations, we explore several design choices of our highly-customizable architecture and validate it on multiple FPGAs. We evaluated various architectures against an optimized Matlab implementation on an Intel Xeon Gold, reaching a speedup up to 2.86x, and remarkable performance and power efficiency against other state-of-the-art approaches

    CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching

    Get PDF
    Regular Expression (RE) matching is a computational kernel used in several applications. Since RE complexity and data volumes are steadily increasing, hardware acceleration is gaining attention also for this problem. Existing approaches have limited flexibility as they require a different implementation for each RE. On the other hand, it is complex to map efficient RE representations like non-deterministic finite-state automata onto software-programmable engines or parallel architectures. In this work, we present CICERO, an end-to-end framework composed of a domain-specific architecture and a companion compilation framework for RE matching. Our solution is suitable for many applications, such as genomics/proteomics and natural language processing. CICERO aims at exploiting the intrinsic parallelism of non-deterministic representations of the REs. CICERO can trade-off accelerators’ efficiency and processors’ flexibility thanks to its programmable architecture and the compilation framework. We implemented CICERO prototypes on embedded FPGA achieving up to 28.6× and 20.8× more energy efficiency than embedded and mainstream processors, respectively. Since it is a programmable architecture, it can be implemented as a custom ASIC that is orders of magnitude more energy-efficient than mainstream processors

    Large Forests and Where to “Partially” Fit Them

    No full text
    The Artificial Intelligence of Things (AIoT) calls for on-site Machine Learning inference to overcome the instability in latency and availability of networks. Thus, hardware acceleration is paramount for reaching the Cloud’s modeling performance within an embedded device’s resources. In this paper, we propose Entree, the first automatic design flow for deploying the inference of Decision Tree (DT) ensembles over Field-Programmable Gate Arrays (FPGAs) at the network’s edge. It exploits dynamic partial reconfiguration on modern FPGA-enabled Systems-on- a-Chip (SoCs) to accelerate arbitrarily large DT ensembles at a latency a hundred times stabler than software alternatives. Plus, given Entree’s suitability for both hardware designers and non-hardware-savvy developers, we believe it has the potential of helping data scientists to develop a non-Cloud-centric AIoT

    A highly scalable and efficient parallel design of N-body simulation on FPGA

    No full text
    N-Body simulation simulates the evolution of a system that is composed of N particles, where each element receives a force that is due to the interaction with all the other elements within the system. Usually, the influence of external physical forces, such as gravity, is involved too. This methodology is widely used in different fields that range from astrophysics, where it is used to study the interaction of celestial objects, to molecular dynamics, where the bodies are represented by molecules. Although its wide range of applicability, the algorithm presents a high computational complexity that requires the usage of powerful and high power consuming computers. An acceleration on a reconfigurable device, such as an FPGA, would benefit both in term of performance and power consumption. In this work we presents a scalable, high performance and highly efficient implementation of an N-Body simulation algorithm on FPGA. The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in terms of performance per watt by a factor of 1.84x

    A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA

    No full text
    Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The recent rapid growth of applications based on deep learning algorithms, especially in the context of Big Data analytics, has dramatically improved both industrial and academic research and exploration of optimized implementations of CNNs on accelerators such as GPUs, FPGAs and ASICs, as general purpose processors can hardly meet the ever increasing performance and energy-efficiency requirements. FPGAs in particular are one of the most attractive alternative, as they allow the exploitation of the implicit parallelism of the algorithm and the acceleration of the different layers of a CNN with custom optimizations, while retaining extreme flexibility thanks to their reconfigurability. In this work, we propose a methodology to implement CNNs on FPGAs in a modular, scalable way. This is done by exploiting the dataflow pattern of convolutions, using an approach derived from previous work on the acceleration of Iterative Stencil Loops (ISLs), a computational pattern that shares some characteristics with convolutions. Furthermore, this approach allows the imple- mentation of a high-level pipeline between the different network layers, resulting in an increase of the overall performance when the CNN is employed to process batches of multiple images, as it would happen in real-life scenarios
    corecore